December 11, 2015



Data Description

Raw Data Spreadsheet

Raw data given as excel file with ~300 tabs, one for each patient, like this:

Patient Data

Along with timeseries measurements, the following static covariates were given for each patient:

  • Age and Sex
  • GCS (Glasgow Coma Scale) - Integer score from 3 (deep unconsciousness) to 15 (normal)
  • Marshall Clasasification - Similar to GCS; Integer from 1 to 6 (lower is better)
  • Estimated Initial Time of Injury

As well as an outcome score called "GOS" measured at 3, 6, 12, and 24 months:

  • GOS - 1=Dead, 2-3=Severely Disabled, 4-5=Good Outcome

Data Preparation

For the sake of modeling, the following assumptions were made:

  • Only measurments within 48 hours of initial injury were considered
  • Patients with no GCS or Marshall score were excluded
  • Patients with less than 8 hours of PbtO2 measurements were also excluded
  • Only the 3 month GOS outcome was used, using the 6 month GOS where 3 month not available

There were 339 patients in raw data but only 268 in filtered dataset.

Note: Of 268 patients in filtered set, 18 had missing 3 month GOS

Descriptive Stats

Static Variable Distributions

Frequency of non-time-dependent values:

Comparing Variables to Outcome

Relationships between static variables and outcomes are pretty clear:

Blood Gas Measurements

Timeseries measurements for 4 random patients:

Blood Gas Measurements

Some observations on timeseries values:

  • Some quantities are sparse (measured more often for sicker people)
  • There are weak correlations between several pairs of variables
  • The amount of time between injury and first measurement can vary a lot:

Global Trends

Modeling

Data Input

A big challenge involves working around the fact that this model for the data below is not valid:

glm(outcome ~ age + pbto2, family='binomial')
age pbto2 uid tsi_min outcome
29 0 636 540 0
29 1 636 600 0
29 2 636 660 0
24 18 656 720 1
24 16 656 1080 1

Data Input

Repeating the non-time-dependent variables like this leads to confidence intervals on coefficients that are unrealistically small (and not valid).

Possible Workarounds:

  1. Take the mean of each timeseries and use that as a single variable
    This doesn't work – all timeseries variables become insignificant
  2. Take multiple distribution statistics (e.g. percentiles) and use those all as variables
    This works better, but has a relatively useless interpretation
  3. Try to find "threshold" values where time spent above or below those values become variables
    This is best, though it introduces thresholds as a variable

All of the models that follow take approach 3, though the way they deal with the thresholds will differ.

Finding Good Thresholds

As a first attempt at modeling, all timeseries variables were separated into ranges based on known guidelines for safe values (Primarily from this Lab Value Guide).

For example, the PaCO2 variable was split into three new variables:

  • paco2_0_35 - % of time spent by patient w/ PaCO2 value in [0, 35)
  • paco2_35_45 - % of time spent by patient w/ PaCO2 value in [35, 45)
  • paco2_45_inf - % of time spent by patient w/ PaCO2 value in [45, +Inf)

For all variables, the known "safe range" was then dropped from the model to avoid linear dependence (percentages across all variables would add up to 1). This leaves only variables indicating time spent at dangerous levels.

Binary Outcome Dataset

Modeling input dataset; includes static covariates and timeseries summaries:
##         variable  mean    sd min   max     type
## 1    icp1_20_inf  0.04  0.13   0  1.00 GasRange
## 2     paco2_0_35  0.44  0.28   0  1.00 GasRange
## 3   paco2_45_inf  0.05  0.12   0  0.80 GasRange
## 4      pao2_0_30  0.04  0.09   0  0.43 GasRange
## 5   pao2_100_inf  0.08  0.15   0  1.00 GasRange
## 6     pbto2_0_20  0.31  0.31   0  1.00 GasRange
## 7  pbto2_100_inf  0.00  0.02   0  0.19 GasRange
## 8     pha_0_7.35  0.10  0.18   0  1.00 GasRange
## 9   pha_7.45_inf  0.30  0.27   0  1.00 GasRange
## 10           age 37.75 16.37  16 77.00   Static
## 11           gcs  5.90  1.46   3  8.00   Static
## 12      marshall  3.06  1.47   1  6.00   Static
## 13           sex  0.78  0.42   0  1.00   Static

Binary GLM Results

Best model selected through exhaustive AIC search:

glmulti('gos ~ .', family='binomial', level=1)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.484 0.198 -7.506 0.000
age -0.639 0.195 -3.275 0.001
gcs 0.542 0.184 2.939 0.003
paco2_45_inf 0.509 0.175 2.916 0.004
pao2_0_30 -0.430 0.202 -2.124 0.034
icp1_20_inf -0.621 0.303 -2.049 0.040
pbto2_0_20 -0.347 0.182 -1.909 0.056
pbto2_100_inf -0.777 0.432 -1.797 0.072
marshall -0.327 0.197 -1.658 0.097

Parameter Averages

Parameter estimates averaged over all models (main effects only):
Estimate Nb models Importance +/- (alpha=0.05)
paco2_0_35 -0.045 37.000 0.294 0.190
pao2_100_inf -0.054 37.000 0.325 0.204
sex 0.067 37.000 0.342 0.238
pha_0_7.35 -0.115 44.000 0.427 0.353
pha_7.45_inf -0.115 47.000 0.454 0.336
marshall -0.241 65.000 0.701 0.450
pbto2_0_20 -0.269 69.000 0.780 0.420
pao2_0_30 -0.423 94.000 0.969 0.416
icp1_20_inf -0.620 95.000 0.977 0.622
pbto2_100_inf -0.748 99.000 0.996 0.844
(Intercept) -1.486 100.000 1.000 0.394
age -0.690 100.000 1.000 0.397
gcs 0.551 100.000 1.000 0.367
paco2_45_inf 0.521 100.000 1.000 0.394

Parameter Average Plot

Plot of 95% Confidence Intervals over all models, sized by importance:

Nonlinear Modeling

An Improved Model

A potential improvement on the linear models would involve not needing to set "threshold" ranges for blood gas measurements manually. The most basic version of this kind of model would involve a single threshold and a step function where having blood gas values on one side of that function has a different effect that the other side.

A generalization of this could allow more than one threshold as well as the possibility for gradual shifts between regions of values where effects differ.

Nonlinear Model

A modified logistic model:

\[ logit(y_i) = \alpha + \beta \cdot X_i + f(G_{ij}) \]

where

\[ X_i = [Gender_i, Age_i, CommaScore_i, MarshallScore_i] \]

and

\[ f(G_i) = \frac{1}{n_i} \sum_j{ \frac{c_1}{1 + e^{-c_2(G_{ij} - c_3)}} + \frac{c_4}{1 + e^{-c_5(G_{ij} - c_6)}} } \] \[ n_i = \text{ length of timeseries for patient }i \]

The "double logistic" function definition for \(f\) allows for flexiblity in thresholds and graduated changes.

What Does Nonlinear Effect Mean?

This is the format that all nonlinear effects will be shown in:

Double Logistic Function Examples

Implementation

MCMC sampling model used: Model on Github

These are effect functions drawn from the priors in the model to show possibilities and biases the model begins with:

Sample Size Effects (Fully Simulated Data)

Function Forms on Semi-Simulated Data

Parameter recovery after hard coding coefficient / function values for the actual (rather than simulated) data:

Results - Intracranial Pressure levels

Results (PaCO2 levels)

Results (pH levels)

Results - PaO2 levels

Results - PbtO2 levels

Done